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Abstract 

A major challenge for the realization of intelligent 
robots is to supply them with cognitive abilities in 
order to allow ordinary users to program them eas- 
ily and intuitively. One way of such programming is 
teaching work tasks by interactive demonstration. To 
make this effective and convenient for the user, the 
machine must be capable to establish a common fo- 
cus of attention and be able to use and integrate spo- 
ken instructions, visual perceptions, and non-verbal 
clues like gestural commands. We report progress 
in building a hybrid architecture that combines sta- 
tistical methods, neural networks, and finite state 
machines into an integrated system for instructing 
grasping tasks by man-machine interaction. The sys- 
tem combines the GRAVIS-robot for visual attention 
and gestural instruction with an intelligent interface 
for speech recognition and linguistic interpretation, 
and an modality fusion module to allow multi-modal 
task-oriented man-machine communication with re- 
spect to dextrous robot manipulation of objects. 



1 Introduction 

In recent years a new generation of intelligent robots 
has found applications in natural environments like 
museums, hospitals, or private households. While 
conventional programming can be efficient for factory 
floor applications, more cognitively oriented robots 
must be instructable by ordinary human users in a 
robust and intuitive way. In this respect, one way to 
program a work task is by interactive human demon- 
stration, which requires the endowment of a robot 
with sufficient perceptual, cognitive, and motor skills 
to communicate with the user in a natural fashion. 
As humans inevitably use different modalities in in- 
terpersonal and man-machine communication, an in- 
telligent robot system should take advantage of this 



information by using and integrating different per- 
ceptual channels. In this paper, we present a combi- 
nation and integration of active vision, gestural in- 
struction, and speech input to instruct a robot sys- 
tem for grasping tasks (Fig. HJ . Though parts of the 
functional modules have been described and evalu- 
ated as standalone applications in more detail earlier 
dUBJEDEllj their integration into a full scale archi- 
tecture is described here for the first time and has 
proven to be a major challenge due to the enormous 
complexity of the overall system. Therefore we focus 
on the architecture and module interconnections and 
highlight some lessons learnt from building such an 
interactive system. As a whole, the described project 
is part of a larger research effort (Bielefeld Special 
Collaborative Research Unit SFB 360 ^Hj) aiming 
towards the development of "situated artificial com- 
municators" that can be interacted with in a natural, 
"human-like" fashion with the combined use of ver- 
bal and non-verbal instructions. It is in line with ear- 
lier work devoted to robot teaching by showing ^H] 
and imitation learning Ej • While there has been 
much work on various aspects of learning in cogni- 
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Figure 1: The interactive scenario. 



tive architectures (speech and image integration |17j . 
trajectory acquisition |51 1141 05) . object recognition 
and grasp pose determination JSj or sensor fusion for 
grasping Q), the design of an integrated architecture 
is widely believed to be very hard to achieve. Thus 
there have been developed only a few advanced ar- 
chitectures which are capable of integrating percep- 
tual attention mechanism with higher level functions 

0E1H3I. 

The next sections provide an overview of the over- 
all system and its highest level building blocks, with 
special emphasis on their mutual interactions. We 
then demonstrate some of the system's capabilities 
and discuss and illustrate the idea that there exists 
a "critical level of skills" from which development of 
the system towards more complex capabilities pro- 
gresses much faster. 
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Figure 2: Modules of the integrated architecture. 



2 System Architecture 

The architecture design is one of the key issues in 
realizing a complex intelligent robot system. From 
an ideal perspective, a common uniform software 
framework should be specified beforehand to sup- 
port a subsequent distributed development of mod- 
ules according to certain specifications. Different 
approaches like behavior based architectures, agent- 
based concepts or blackboard systems have been pro- 
posed in this context. 

However, in a truly complex system very different 
types of signals are generated at different time scales 
and require many sub-skills to be developed under 
diverse programming paradigms. In Section 8 we 
discuss further reasons why from our experience it 
is unreasonable to impose strong constraints on the 
submodules for easier software engineering. As a 
consequence, we find that it is rather the level of the 
architecture which has to support the integration of 
heterogeneous components. 

Our entire system is implemented as a larger num- 
ber of separate processes running in parallel on sev- 
eral workstations and communicating with the dis- 
tributed architecture communication system (DACS 
jllp developed earlier for the purpose of this project. 
Hereby the submodules use different programming 
languages (C,C++,Tcl/Tk,Neo/NST), various visu- 
alization tools, and a variety of processing paradigms 
ranging from a neurally inspired attention system to 
statistical and declarative methods for inference and 
knowledge representation. Thus the architecture as 
a whole cannot be easily subsumed under any single 
one of the programming paradigms mentioned above. 

Figure |21 shows a coarse overview of the main in- 
formation processing paths. The speech processing 
(left) and the attention mechanism (right) provide 



linguistic and visual/gestural inputs converging in an 
integration module which then passes control to the 
manipulator. Additionally, there are control com- 
mands for parts of the system (e.g. on, off, calibrate 
skin, park robot arm,...). The modules and some of 
their interactions are further described in the follow- 
ing sections. 

2.1 Hardware Basis 

The vision hardware currently consists of a binocu- 
lar active vision head with two 3-chip-CCD color- 
cameras, controllable pan, tilt, left/right vergence 
and motorized lenses determining focus, zoom and 
aperture, which combine to a total of 10 DOFs. The 
grasping and manipulation is carried out by a stan- 
dard 6DOF PUMA manipulator operated with the 
real-time RCCL-command library. It is additionally 
equipped with a wrist camera to obtain local visual 
feedback during the grasping phase. 

Grasping is carried out by a 9DOF dextrous robot 
hand developed at the Technical University of Mu- 
nich. It has three approximately human-sized fin- 
gers driven by an oil hydraulics system. The fin- 
gertips have custom built fingertip sensors to pro- 
vide force feedback for control and evaluation of the 
grasp. The hardware setting and its control design 
has been described in more detail in Recently 
we have changed the original hand design by adding 
a palm and rearranging the fingers in a more human- 
like configuration (Fig. I1I8[) to allow a larger variety 
of two- and thrcc-finger grasps. 

3 Visual Attention and Memory 

A necessary prerequisite for successful human- 
machine interaction is to establish and maintain a 
common focus of attention between the user and the 




Figure 3: User speech input ("take the red ... ") 
can bias the attention system towards special features 
(red) and 3D-pointing gestures impose constraints for 
spatial interest regions. 

vision system of the robot. Furthermore, a short 
term visual memory has to be realized in order to 
understand linguistic reference to objects in spoken 
instructions. Our attention system places a high em- 
phasis on the spatial organization of visual clues and 
enhances a design proposed in It consists of a 
layered system of topographically organized neural 
maps for integrating different low-level feature maps 
into a continually updated focus of attention for the 
active camera head. Similar mechanisms have also 
been employed in |2T], however, only results 
for highly idealized synthetic images or using a lower 
number and less complex maps are reported. 

In particular, from the stereo images a number of fea- 
ture maps indicating the presence of oriented edges, 
HSI-color saturation & intensity, motion (difference 
map), and skin color are computed. As one of the 
main goals of the system is to recognize pointing 
hands, we multiply the difference map (indicating 
movement) by the skin segmentation map (indicating 
a hand). The result is a "moving skin" map, which 
is considered as a separate feature map. A weighted 
sum of these feature maps is multiplied by a fadeout- 
map to form a final attention map and the highest 
peak determines the next fixation, see Fig. [3J Af- 
ter stereo matching, the resulting loop continuously 
generates saccades for fixations and this active ex- 
ploration behavior persists during the whole system 
operation. 

Interaction with the human user can modify the at- 
tention map by two different mechanisms. If a spo- 
ken instruction references a colored object (" ... the 
red cube ...") the corresponding weight is increased 
to bias the attention system towards red spots in the 
image. This increases the probability for fixations on 
red things, but after some time a decay mechanism 



drives the weighting back to a default level. 

If the hand and gesture recognition modules detect 
a pointing gesture in the image, the 3D-direction of 
the pointing finger is computed and a corresponding 
region of interest is virtually projected on the ta- 
ble. A respective "manipulation map" is multiplied 
coordinate-wise with the attention map to restrict 
the explorative attention to that region in the next 
step, see Fig. |3J 

The exploration behavior tends to fixate repetitively 
upon the most interesting points, which are in most 
cases objects. This "emerging regularity" is used to 
establish a short term visual memory in the integra- 
tion module to which all 3D-fixation coordinates are 
sent. It uses temporal integration to stabilize only 
the most salient points and if additionally a homo- 
geneous color blob is detected, it is assumed that 
there is an object, which then can be referenced by 
spoken instructions. Future extensions will add a 
more sophisticated object recognition (already avail- 
able for the grasping feedback) at this point. Also 
we plan to add more specific object maps in the at- 
tention system, which then can be favored by spoken 
instructions exactly like the color maps. 

4 Speech and Language 

To allow a fluent communication between the in- 
structor and the artificial communicator our sys- 
tem is capable of understanding speaker independent 
speech input. The instructor neither needs to know 
a special command syntax nor the exact terms or 
identifiers of the objects. Consequently, the complete 
speech understanding system has to face a high de- 
gree of referential uncertainty from vague meanings, 
speech recognition errors, and un-modeled language 
structures. 

Our approach to robust spoken language understand- 
ing uses a vertical organization of knowledge rep- 
resentation and an integrated processing scheme to 
overcome the drawbacks of the traditional horizon- 
tal architecture @|. As baseline module ^H] it em- 
ploys an enhanced statistical speech recognizer. The 
recognition process is directly influenced by a partial 
parser which provides linguistic and domain-specific 
restrictions on word sequences. Therefore, partial 
syntactic structures instead of simple word sequences 
are generated, like e.g. object descriptions ("the red 
cube") or spatial relations ("...in front of..."). These 
are combined by the subsequent speech understand- 
ing module to form linguistic interpretations. 

To cope with out- of -vocabulary words we employ a 
recognition lexicon which exceeds the one used by 
the understanding component but covers all lexi- 
cal items frequently found in our corpus of human- 



human and human-machine dialogs. The syntactic 
modeling then allows one to use these additional 
words to be filled-in for such open lexical categories 
as nouns, for example. In a robust system the 
speech processing modules have to be able to cope 
with spontaneous speech input which largely devi- 
ates from speech read from text prompts or used in 
a dictation task. Particularly clear pronunciation, 
vocabulary limitations, and restrictions in language- 
use can never be enforced. To meet these challenges 
the recognition lexicon contains acoustic models for 
spontaneous speech phenomena, namely for so-called 
human noises (breathing or lip smacks) and hesita- 
tions (like 'uhm'). 

5 Integration 

5.1 Interrelating Speech and Vision 

If a naive user describes an object in the scene by 
using attributes he or she will typically use a vo- 
cabulary which is different from the fixed one ap- 
propriate for processing of visual data. Therefore, 
several kinds of uncertainties have to be considered 
when correlating a verbal object description and ob- 
ject recognition results, such as vague attributes (e.g. 

"the long, thin stick"), vague spatial and structural 
descriptions (e.g. "the object to the left of the cube", 

"the cube with the bolt"), or speech and object recog- 
nition errors. 
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Figure 4' Bayesian network connecting speech and 
vision data for one related reference object. 



In order to cope with this uncertainty, we have de- 
veloped a Bayesian network approach that robustly 
combines verbal and visual information through dif- 
ferent abstraction levels On the first level, ba- 
sic features from vision (C ms -.color, T ms .-elemental 
type) and speech (T :type, C .-color, S -.shape, Z -.size) 
are modeled as evidential nodes of the Bayesian net- 
work for each of the n visual objects and N verbally- 
referenced objects. On the second level, these are 
fused to the visual object class n y and verbal 

object class Oi / r03 ■ je{i,...,iv-i} which are connected 
by the intended object and reference object variables 
IO, ROj € {1 . . . n}. The verbally- mentioned spatial 
or structural relations between the objects are es- 
tablished by introducing additional evidential nodes 
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Figure 5: Model of the man-machine communica- 
tion. 

Rj (Fig. |2J. The different kinds of uncertainties are 
modeled by conditional probability tables that have 
been estimated from experimental data |22j . The 
objects which are denoted in the utterance are those 
explaining the observed visual and verbal evidences 
e ms , e verb in the Bayesian network with the maxi- 
mum a posteriori probability. Additional causal sup- 
port for an intended object IO is defined by an op- 
tional target region of interest that is provided from 
the 3D-pointing evaluation. The intended object IO 
is then used by the dialog component for system re- 
sponse and manipulator instruction. 

5.2 Dialog system 

Many dialog systems developed recently lack inte- 
gration with other modalities. In contrast to such 
uni-modal approaches our dialog module integrates 
utterances of the instructor, information of the vis- 
ible scene, and feedback from the robot to realize a 
natural, flexible and robust dialog strategy. 

The dialog module is realized within the seman- 
tic network language Ernest using the dialog model 
shown in Fig. [3J The model is based on an in- 
vestigation of a corpus of human-human and simu- 
lated human-machine dialogs. Every path through 
the model reflects a course of a possible human- 
machine dialog. The admissible sequence of interme- 
diate states is nearly unrestricted leading to a very 
natural and robust dialog behavior 0|. 

State transitions are initiated if new information 
from the instructor or the robot is available. The 
state transition function analyzes the new informa- 
tion and combines it with the current dialog con- 
text and information gathered from the interrelation 
module to select the next state. 

Using the dialog context, references between objects 
can be resolved, "Take the red bolt. Put it into the 
cube. ", and information accumulated in the dialog 
can be combined. The dialog module can react upon 
new information from different modalities to inform 
the instructor about errors during the execution of an 
action and can actively control the dialog to query 
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Figure 6: The attention map: hot spots, camera 
image, stereo matched points to be transferred to the 
integration module (upper row). Hand and finger 
recognition uses a multi-layer perceptron based clas- 
sification of the intensity histograms (middle row). 
Projection of the 3D-pointing direction on the table 
(lower row). 



for missing or unprecise information. The overall 
goal of this module is to continue the dialog in ev- 
ery situation. Actions which cannot be executed are 
immediately rejected. For verbal instructions which 
could not be analyzed a repetition is requested up 
to two times. If the dialog has gathered completely 
contradictory information the system expresses its 
confusion and asks for a new instruction. 

6 Manipulation 

Once the integration module has resolved ambigui- 
ties, control is passed to the robot arm/hand. Start- 
ing from the 3D-coordinates determined by the vision 
and integration modules the approaching movement 
and grasping is executed in a semi-autonomous fash- 
ion relying on local feedback only. The arm and hand 
control is implemented as a finite state automaton, 
switching between different arm modes (approach, 
refine, closer, re-align,...) and hand states (open, pre- 
shape, grasp, hold, release,...) whose transitions are 
triggered by visual and tactile feedback. In partic- 
ular, the wrist camera provides visual feedback and 
object recognition to approach the grasp offset posi- 
tion and, in the grasping phase, the fingertip sensors 
provide the necessary force feedback. 

The grasping sequence starts with an approach 
movement, recenters the manipulator above the ob- 
ject, chooses a grasp prototype according to the rec- 
ognized object, aligns the hand along the main axis 
of the object and executes the grasp prototype, for 



more details see [201 • After successful gripping, a 
similar chain of events allows the robot to put the 
object down in another gesturally selected location. 

7 An action sequence 

To illustrate the capabilities of our system we present 
a (simplified) typical action sequence for picking up 
and deploying an object in sequential order. Some 
videos can be found at 12 . The sequence consists 
of 8 major stages: 

I) Initially, a number of objects are spread on a ta- 
ble in the workspace of robot arm and camera. The 
system can be started and partially calibrated by 
speech commands shown in Fig. (right display). 
The attention systems explores the scene as shown 
in Fig. (top row) and transmits the fixation points 
to the integration module, where the visual memory 
is stabilized and the spatial object relations are an- 
alyzed, see Fig. (lower left display). 
2) A user gives a spoken instruction referencing one 
of the objects. The instruction is semantically ana- 
lyzed and the dialog is initiated, see Fig. [7| (upper left 
display). The system may ask for additional pointing 
information, e.g. for resolving ambiguities. It also 
determines, whether the attention system should be 
biased towards particular colors. 

3) When a pointing hand is found, the gesture is eval- 
uated as visualized in Fig. (middle row) and the 3D 
interest region is fed to the integration module. 

4) The Bayesian network integrates the spoken in- 
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Figure 7: The speech input is analyzed and 
segmented into semantic categories like action 
(S.AKTION nimm) or object (S.OBJEKT den 
Block) (upper left display). Spatial relations between 
objects in the short time memory (lower display) can 
be referenced by instructions. In the right window a 
number of direct commands are available, which can 
also be given by spoken instructions. 




Figure 8: Visual and tactile feedback for grasping: 
a) view through the hand camera for the approach 
movements; b) two finger grasp of a cube; c) hand 
camera view before three finger grasp;d) three finger 
grasp; e) force feedback from the fingertips to evaluate 
the grasp. 

struction, the visual memory, and the gesture-based 
region bias to determine the object to be grasped. In 
case this fails, the dialog asks for a repetition of the 
instruction and a new gesture. 

5) Control is passed to the hand/arm system, 
which performs a visually guided approach move- 
ment (Fig. |Efa)), determines a grasp primitive and 
pre-shapes the hand (c) , aligns it with respect to the 
object and finally grasps the object (b,d)) with force 
feedback control (e). Upon a failure, it retries and 
on success the integration module is informed. 

6) The dialog system asks the user to indicate where 
to deploy the object. 

7) The pointing evaluation part of 4) is repeated with 
the slight difference that now the 3D-fingertip posi- 
tion directly determines the position of object de- 
ployment. 

8) Control is redirected to the hand/arm system, 
which deploys the object and the system returns into 
the starting mode of exploration. 

8 The Critical Level of Skills 

As described above, our system integrates a larger 
number of skills, local feedback mechanisms, and 
state machines. Many of the modules have been 
developed and tested independently of each other, 
can be trained offline, and have adaptive calibration 
facilities EDI 1221 Most 01 them are 

much more 

powerful when operated standalone; however, in the 
integrated system, the full capabilities of each mod- 
ule are not always employed. This is due to mu- 



tual interdependencies and less specialized hardware 
delivering a lower quality of sensory inputs. Con- 
sequently, a potential of "hidden capabilities" and 
resources exists in the overall system. 

One approach to avoid this apparent waste of capa- 
bilities is to restrict the solution space for the individ- 
ual modules to ensure a high degree of homogeneity 
towards a beforehand specified scenario. However, 
we find that this is not reasonable because once a ro- 
bust functioning of the overall system is achieved, we 
can benefit from the "hidden capabilities" in certain 
modules quite easily. We experience that small co- 
ordinated modifications in several modules or slight 
changes in the control flow can quickly open new 
and unforeseen perspectives for the system. We give 
two examples to illustrate this: Recognition of bars 
together with a corresponding grasp prototype al- 
lows us to progress from cube-based pyramid build- 
ing to cube and bar based building of bridges, houses, 
closed boxes, etc.. Secondly, a slight change in the 
speech-initiated control allows to reuse the fingertip 
detection algorithm initially employed to find objects 
for deploying them at fingertip positions. The same 
capabilities can be used to teach multi-point trajec- 
tories just by pointing to consecutive positions or to 
indicate small relative movements by pointing to two 
nearby positions subsequently. 

We believe that this experience can be summarized 
as approaching a critical level of skills. This level 
is characterized by a situation where small improve- 
ments or (adaptive) reconfiguration of single mod- 
ules or slight changes in the control flow immediately 
open up a whole new variety of action opportunities. 
Hereby we benefit from a certain amount of robust- 
ness, the possibility to readapt or recalibrate, and a 
rather loose coupling between the modules, which in 
our architecture is realized by the message-passing 
communication paradigm and which allows quick re- 
organization of the control flows. The interactive 
teaching of tasks then can take full advantage of the 
user's creativity to recombine the system's skills to- 
wards previously unexpected results. 

9 Discussion 

The presented architecture integrates a set of capa- 
bilities to enable an intuitive programming of grasp- 
ing tasks by a human user. It ranges from a percep- 
tual grounding in an active exploration of the scene 
up to an interpretation of complex user commands 
by a sophisticated speech analysis and modality fu- 
sion system. As there are no widely accepted bench- 
marks for cognitive robotic systems interacting with 
humans, it is difficult to assess the performance of 
such systems systematically and beyond demonstrat- 
ing that they are indeed running by examples. Thus, 



currently we are adding a visualization and monitor- 
ing module, which will also be able to record action 
sequences and will enable a more quantitative per- 
formance analysis. 

We think one of the major challenges is to lift learn- 
ing in our system from the offline training widely 
used in the lower level modules to the level of be- 
havior. The current system, enhanced by a system 
monitor, will offer a tool to study how such learn- 
ing needs to be organized to progress from imitation 
of human-instructed action sequences to extracting 
knowledge on the task level. Some of the many issues 
will be how to propagate errors top down and how to 
flexibly reorganize the control flow without losing ro- 
bustness and functionality of the system. Only then 
will we come closer to easily-instructable intelligent 
systems that can robustly carry out non-trivial tasks 
in natural environments. 
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